The purpose of this memo is to explore data on Airbnb listings in Barcelona and understand the key drivers of price, from the viewpoint of 2 people looking to visit the city for 4 nights.
What is the best model to predict the price of 2 adults visiting Barcelona for 4 nights?
We started our project with Exploratory Data Analysis (EDA), which offered insights into the types, content, and relevance of the existing variables. We then progressed to our models, after checking for the correlation of our independent variables, in order to find an optimal regression model that will predict the price.
This section explores the steps we have taken to tidy and understand the data:
After selecting and adjusting relevant variables, we created summary statistics, charts, tables, and correlation matrices to visualise the data. During that, we concluded some correlation between [insert findings]
This section showcases our work on examining relevant independent variables and constructing a number of different regression models to identify the optimal model, which should be the one which explains the highest portion of variance in the price of Airbnb listings in Barcelona.
Predictor variables across the following categories: prop_type_simplified review_scores_rating room_type minimum_nights bathrooms bedrooms accommodates instant_bookable neighbourhood_simplified availability_30 reviews_per_month host_is_superhost host_response_rate host_acceptance_rate review_scores_cleanliness review_scores_checkin review_scores_location review_scores_value
Problematic collinearity: We include several variables from the same category in our price prediction model. There is no significant collinearity problem for our model, since, statistically, we have Square of GVIF^(1/(2*DF)) less than 5, which suggests there is no significant collinear relationships between our variables used for price prediction. However, there might be some further adjustments to out best model to improve the R-square for better price prediction, such as including more factors in some other different categories and also to make sure the collinearity of our variables become as small as possible.
The estimated stay for 2 adults spending four nights in Barcelona would cost is 835.0494, with 95% confidence interval of [512.9052,1613.538].
In our EDA, we will attempt to answer to following questions:
review_scores_rating: Average review score (0 - 100)We use glimpse() to have a look at the data. There are 16,206 observations across 74 variables and presumably we might not actually need all of them, because some might be overlapping and some irrelevant.
glimpse(listings)Rows: 16,206
Columns: 74
$ id <dbl> 18674, 23197, 32711, 3498~
$ listing_url <chr> "https://www.airbnb.com/r~
$ scrape_id <dbl> 2.021091e+13, 2.021091e+1~
$ last_scraped <date> 2021-09-10, 2021-09-11, ~
$ name <chr> "Huge flat for 8 people c~
$ description <chr> "110m2 apartment to rent ~
$ neighborhood_overview <chr> "Apartment in Barcelona l~
$ picture_url <chr> "https://a0.muscache.com/~
$ host_id <dbl> 71615, 90417, 135703, 731~
$ host_url <chr> "https://www.airbnb.com/u~
$ host_name <chr> "Mireia And Maria", "Etai~
$ host_since <date> 2010-01-19, 2010-03-09, ~
$ host_location <chr> "Barcelona, Catalonia, Sp~
$ host_about <chr> "We are Mireia (43) & Mar~
$ host_response_time <chr> "within an hour", "within~
$ host_response_rate <chr> "100%", "100%", "100%", "~
$ host_acceptance_rate <chr> "83%", "70%", "100%", "83~
$ host_is_superhost <lgl> FALSE, FALSE, FALSE, TRUE~
$ host_thumbnail_url <chr> "https://a0.muscache.com/~
$ host_picture_url <chr> "https://a0.muscache.com/~
$ host_neighbourhood <chr> "la Sagrada Família", "El~
$ host_listings_count <dbl> 35, 2, 5, 3, 4, 4, 4, 4, ~
$ host_total_listings_count <dbl> 35, 2, 5, 3, 4, 4, 4, 4, ~
$ host_verifications <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ neighbourhood <chr> "Barcelona, CT, Spain", "~
$ neighbourhood_cleansed <chr> "la Sagrada Família", "el~
$ neighbourhood_group_cleansed <chr> "Eixample", "Sant Martí",~
$ latitude <dbl> 41.40556, 41.41291, 41.40~
$ longitude <dbl> 2.17262, 2.22063, 2.17015~
$ property_type <chr> "Entire rental unit", "En~
$ room_type <chr> "Entire home/apt", "Entir~
$ accommodates <dbl> 8, 6, 6, 9, 2, 1, 1, 2, 1~
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text <chr> "2 baths", "2 baths", "1.~
$ bedrooms <dbl> 3, 3, 2, 4, 1, 1, 1, 1, 1~
$ beds <dbl> 6, 5, 3, 6, 1, 1, 1, 1, 1~
$ amenities <chr> "[\"TV\", \"Wifi\", \"Hai~
$ price <chr> "$121.00", "$229.00", "$1~
$ minimum_nights <dbl> 1, 4, 2, 4, 2, 2, 2, 2, 2~
$ maximum_nights <dbl> 1125, 300, 31, 365, 60, 6~
$ minimum_minimum_nights <dbl> 1, 4, 2, 3, 2, 2, 2, 2, 2~
$ maximum_minimum_nights <dbl> 3, 7, 2, 4, 2, 2, 2, 2, 2~
$ minimum_maximum_nights <dbl> 1125, 1125, 31, 365, 1125~
$ maximum_maximum_nights <dbl> 1125, 1125, 31, 365, 1125~
$ minimum_nights_avg_ntm <dbl> 1.6, 4.1, 2.0, 4.0, 2.0, ~
$ maximum_nights_avg_ntm <dbl> 1125, 1125, 31, 365, 1125~
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30 <dbl> 1, 22, 0, 15, 0, 0, 0, 0,~
$ availability_60 <dbl> 26, 52, 0, 36, 0, 0, 0, 0~
$ availability_90 <dbl> 50, 82, 0, 66, 0, 0, 0, 0~
$ availability_365 <dbl> 58, 128, 44, 156, 163, 16~
$ calendar_last_scraped <date> 2021-09-10, 2021-09-11, ~
$ number_of_reviews <dbl> 21, 52, 63, 154, 358, 309~
$ number_of_reviews_ltm <dbl> 0, 0, 0, 5, 45, 56, 39, 2~
$ number_of_reviews_l30d <dbl> 0, 0, 0, 2, 0, 0, 0, 0, 1~
$ first_review <date> 2014-03-02, 2016-01-05, ~
$ last_review <date> 2019-10-11, 2019-12-15, ~
$ review_scores_rating <dbl> 4.40, 4.74, 4.31, 4.52, 4~
$ review_scores_accuracy <dbl> 4.55, 4.92, 4.27, 4.66, 4~
$ review_scores_cleanliness <dbl> 4.75, 4.92, 4.38, 4.62, 4~
$ review_scores_checkin <dbl> 4.80, 4.94, 4.81, 4.65, 4~
$ review_scores_communication <dbl> 4.90, 4.98, 4.79, 4.67, 4~
$ review_scores_location <dbl> 4.75, 4.67, 4.81, 4.72, 4~
$ review_scores_value <dbl> 4.30, 4.69, 4.40, 4.48, 4~
$ license <chr> "HUTB-002062", "HUTB-0050~
$ instant_bookable <lgl> TRUE, FALSE, TRUE, FALSE,~
$ calculated_host_listings_count <dbl> 19, 2, 3, 2, 4, 4, 4, 4, ~
$ calculated_host_listings_count_entire_homes <dbl> 19, 2, 3, 2, 0, 0, 0, 0, ~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 0, 0, 4, 4, 4, 4, 1~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month <dbl> 0.23, 0.75, 0.61, 1.54, 4~
We use skim() to identify missing values, and get a better picture of the data set. We make the following observations: - Some numeric variables are listed as character - There are a lot of missing values for character and numeric variables - Especially for the ratings, a lot of values are missing - There are many irrelevant variables
skim(listings)| Name | listings |
| Number of rows | 16206 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 25 |
| Date | 5 |
| logical | 7 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 16206 | 0 |
| name | 10 | 1.00 | 1 | 255 | 0 | 15662 | 0 |
| description | 118 | 0.99 | 1 | 1000 | 0 | 14728 | 0 |
| neighborhood_overview | 6556 | 0.60 | 1 | 1000 | 0 | 7139 | 0 |
| picture_url | 0 | 1.00 | 61 | 126 | 0 | 15925 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 7521 | 0 |
| host_name | 40 | 1.00 | 1 | 35 | 0 | 3469 | 0 |
| host_location | 52 | 1.00 | 2 | 166 | 0 | 609 | 0 |
| host_about | 6149 | 0.62 | 1 | 5756 | 0 | 3946 | 5 |
| host_response_time | 40 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 40 | 1.00 | 2 | 4 | 0 | 68 | 0 |
| host_acceptance_rate | 40 | 1.00 | 2 | 4 | 0 | 93 | 0 |
| host_thumbnail_url | 40 | 1.00 | 55 | 106 | 0 | 7477 | 0 |
| host_picture_url | 40 | 1.00 | 57 | 109 | 0 | 7477 | 0 |
| host_neighbourhood | 4898 | 0.70 | 3 | 33 | 0 | 131 | 0 |
| host_verifications | 0 | 1.00 | 2 | 156 | 0 | 276 | 0 |
| neighbourhood | 6556 | 0.60 | 5 | 55 | 0 | 74 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 44 | 0 | 72 | 0 |
| neighbourhood_group_cleansed | 0 | 1.00 | 6 | 19 | 0 | 10 | 0 |
| property_type | 0 | 1.00 | 4 | 35 | 0 | 55 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 19 | 1.00 | 6 | 17 | 0 | 36 | 0 |
| amenities | 0 | 1.00 | 2 | 2173 | 0 | 13656 | 0 |
| price | 0 | 1.00 | 5 | 9 | 0 | 530 | 0 |
| license | 6007 | 0.63 | 1 | 243 | 0 | 4894 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-09-10 | 2021-09-11 | 2021-09-11 | 2 |
| host_since | 40 | 1.00 | 2008-09-19 | 2021-08-24 | 2016-03-02 | 3004 |
| calendar_last_scraped | 0 | 1.00 | 2021-09-10 | 2021-09-11 | 2021-09-11 | 2 |
| first_review | 4417 | 0.73 | 2011-01-04 | 2021-09-10 | 2018-10-07 | 2702 |
| last_review | 4417 | 0.73 | 2011-06-23 | 2021-09-10 | 2020-02-15 | 1760 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 40 | 1 | 0.16 | FAL: 13562, TRU: 2604 |
| host_has_profile_pic | 40 | 1 | 1.00 | TRU: 16107, FAL: 59 |
| host_identity_verified | 40 | 1 | 0.78 | TRU: 12590, FAL: 3576 |
| bathrooms | 16206 | 0 | NaN | : |
| calendar_updated | 16206 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.99 | TRU: 16066, FAL: 140 |
| instant_bookable | 0 | 1 | 0.45 | FAL: 8931, TRU: 7275 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.793740e+07 | 15720291.91 | 1.867400e+04 | 1.500257e+07 | 2.928164e+07 | 4.190451e+07 | 5.218653e+07 | <U+2586><U+2585><U+2586><U+2587><U+2587> |
| scrape_id | 0 | 1.00 | 2.021091e+13 | 0.00 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | 2.021091e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| host_id | 0 | 1.00 | 1.176105e+08 | 122466396.31 | 3.073000e+03 | 9.024729e+06 | 6.134564e+07 | 2.160151e+08 | 4.199209e+08 | <U+2587><U+2582><U+2582><U+2582><U+2581> |
| host_listings_count | 40 | 1.00 | 2.520000e+01 | 117.79 | 0.000000e+00 | 1.000000e+00 | 3.000000e+00 | 1.600000e+01 | 2.178000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 40 | 1.00 | 2.520000e+01 | 117.79 | 0.000000e+00 | 1.000000e+00 | 3.000000e+00 | 1.600000e+01 | 2.178000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 4.139000e+01 | 0.01 | 4.135000e+01 | 4.138000e+01 | 4.139000e+01 | 4.140000e+01 | 4.146000e+01 | <U+2581><U+2587><U+2585><U+2581><U+2581> |
| longitude | 0 | 1.00 | 2.170000e+00 | 0.02 | 2.090000e+00 | 2.160000e+00 | 2.170000e+00 | 2.180000e+00 | 2.230000e+00 | <U+2581><U+2582><U+2587><U+2585><U+2581> |
| accommodates | 0 | 1.00 | 3.400000e+00 | 2.24 | 0.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.600000e+01 | <U+2587><U+2585><U+2581><U+2581><U+2581> |
| bedrooms | 558 | 0.97 | 1.680000e+00 | 1.02 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.600000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 306 | 0.98 | 2.340000e+00 | 1.94 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 1.348000e+01 | 32.94 | 1.000000e+00 | 1.000000e+00 | 3.000000e+00 | 3.000000e+01 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 7.067200e+02 | 492.75 | 1.000000e+00 | 1.200000e+02 | 1.125000e+03 | 1.125000e+03 | 3.000000e+03 | <U+2586><U+2587><U+2581><U+2581><U+2581> |
| minimum_minimum_nights | 1 | 1.00 | 1.376000e+01 | 33.33 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+01 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_minimum_nights | 1 | 1.00 | 1.589000e+01 | 35.24 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 3.000000e+01 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_maximum_nights | 1 | 1.00 | 8.019400e+02 | 538.78 | 1.000000e+00 | 3.300000e+02 | 1.125000e+03 | 1.125000e+03 | 9.999000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_maximum_nights | 1 | 1.00 | 2.658599e+05 | 23856502.75 | 1.000000e+00 | 3.600000e+02 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 1 | 1.00 | 1.473000e+01 | 34.14 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 3.000000e+01 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 1 | 1.00 | 2.652532e+05 | 23802528.85 | 1.000000e+00 | 3.450000e+02 | 1.125000e+03 | 1.125000e+03 | 2.142625e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1.00 | 9.230000e+00 | 11.07 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 1.800000e+01 | 3.000000e+01 | <U+2587><U+2582><U+2581><U+2581><U+2582> |
| availability_60 | 0 | 1.00 | 2.273000e+01 | 23.23 | 0.000000e+00 | 0.000000e+00 | 1.500000e+01 | 4.500000e+01 | 6.000000e+01 | <U+2587><U+2581><U+2582><U+2582><U+2583> |
| availability_90 | 0 | 1.00 | 3.871000e+01 | 35.80 | 0.000000e+00 | 0.000000e+00 | 3.900000e+01 | 7.400000e+01 | 9.000000e+01 | <U+2587><U+2581><U+2582><U+2583><U+2585> |
| availability_365 | 0 | 1.00 | 1.648600e+02 | 136.38 | 0.000000e+00 | 9.000000e+00 | 1.515000e+02 | 3.080000e+02 | 3.650000e+02 | <U+2587><U+2583><U+2582><U+2583><U+2586> |
| number_of_reviews | 0 | 1.00 | 3.340000e+01 | 66.33 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 | 3.400000e+01 | 8.530000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 3.140000e+00 | 7.65 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 1.450000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_l30d | 0 | 1.00 | 6.500000e-01 | 1.76 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.040000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 4417 | 0.73 | 4.500000e+00 | 0.75 | 0.000000e+00 | 4.400000e+00 | 4.670000e+00 | 4.890000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 4582 | 0.72 | 4.640000e+00 | 0.54 | 0.000000e+00 | 4.550000e+00 | 4.800000e+00 | 4.960000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 4579 | 0.72 | 4.580000e+00 | 0.56 | 0.000000e+00 | 4.470000e+00 | 4.730000e+00 | 4.930000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 4585 | 0.72 | 4.740000e+00 | 0.48 | 0.000000e+00 | 4.690000e+00 | 4.890000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 4580 | 0.72 | 4.730000e+00 | 0.50 | 0.000000e+00 | 4.670000e+00 | 4.880000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 4586 | 0.72 | 4.750000e+00 | 0.39 | 0.000000e+00 | 4.680000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 4587 | 0.72 | 4.500000e+00 | 0.54 | 0.000000e+00 | 4.360000e+00 | 4.620000e+00 | 4.810000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| calculated_host_listings_count | 0 | 1.00 | 1.697000e+01 | 32.72 | 1.000000e+00 | 1.000000e+00 | 3.000000e+00 | 1.600000e+01 | 1.770000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 1.445000e+01 | 32.26 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.200000e+01 | 1.770000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 2.320000e+00 | 9.07 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 9.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 8.000000e-02 | 0.82 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.500000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| reviews_per_month | 4417 | 0.73 | 1.160000e+00 | 1.42 | 1.000000e-02 | 1.800000e-01 | 6.900000e-01 | 1.660000e+00 | 2.400000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Some of the pieces of data are of the wrong type or contain irrelevant data. For example, “price” is not only a “character” variable, but also contains the troublesome “$” symbol, which must be removed before proper analysis can begin. Also, the “bathrooms_text” data seems to be of use to our analysis, but first we need to once again turn it into a double-type data and remove the pesky text from it.
Some of the listings are located in irrelevant cities or countries. The “host_location” variable has observations of listings which are not located in Barcelona, and as such we must filter out pieces of data from irrelevant locations in the data frame.
There are some variables that seem rather irrelevant. For example, the “last_scraped” data, “name” and other such data seem a bit useless for analytical purposes, so we have to filter them out.
There are NA values which we must omit for our analysis to be as accurate as possible.
For the convenience of coding, comparing data and analysing, we converted the following 4 variables to numeric: price, bathrooms_text, host_response_rate, host_acceptance_rate.
We created “property_type” by grouping data into 5 categories based on frequency distribution to control from the largest subgroups. Our 5 categories are “Entire rental unit”,“Private room in rental unit”, “Entire serviced apartment”,“Entire condominium (condo)”, “Other”.
We created “neighbourhood_group_cleansed” by categorizing neighbourhoods into 4 different categories: North, Center, Coastal Line and West.
In addition, we created 5 data frames: host_specific, property_specific, reviews_specific, logical_specific, categorical_specific. For each of the 5 data frames, we have an overview of distribution and characteristics of each variable to see which variables might have the strongest power of predicting the price.
In the correlation matrix, blue represents for positive correlation and red represents for negative correlation. The darker the color is, the stronger the correlation between the variables. From our correlation matrix graph, we can see that the correlations between any two of bedroom, beds and accommodates are positive and relatively high. Also, another group of variables has relatively high positive correlation is the review-related variables, such as review scores cleanliness, review scores checkin, review scores communication, review scores location and review scores value. I think it is reasonable since these variables all describe reviews so they are somehow link to each other. Otherwise, variables have relatively weak negative correlation as indicated by very light red colour.
Price is the dependent variable, while factors that might influence the price and people’s preferences regarding the Barcelona AirBnB are independent variable, for example, property type, number of reviews, neighbourhood and so on.
We start out by cleansing out the entire data frame for only entries in which “Barcelona” is mentioned so that henceforth we know we are only wrangling relevant geographical data.
barcelona <- listings %>%
filter(host_location %in% c("Barcelona",
"BARCELONA",
"Barcelona, Barcelona, Spain",
"Barcelona, BARCELONA, Spain",
"Barcelona, Catalonia, Spain",
"Barcelona, Cataluña, Spain",
"Barcelona, Catalunya, Spain")) Having done this, we can now filter out the data frame for specific variables that we wish to analyse, turn the problematic pieces of data into their correct data types, and remove the NA data from our frame (thus fixing the other issues mentioned above). After thorough consideration, we decided that the factors which are most interesting and relevant to our EDA are in relation to the characteristics of the hosts themselves and their properties.
barcelona <- listings %>%
select(
id,
host_response_rate,
host_acceptance_rate,
host_is_superhost,
host_listings_count,
host_has_profile_pic,
host_identity_verified,
neighbourhood_cleansed,
has_availability,
instant_bookable,
neighbourhood_group_cleansed,
latitude,
longitude,
property_type,
room_type,
property_type,
accommodates,
bathrooms_text,
bedrooms,
beds,
price,
minimum_nights,
maximum_nights,
availability_30,
number_of_reviews,
review_scores_rating,
review_scores_accuracy,
review_scores_cleanliness,
review_scores_checkin,
review_scores_communication,
review_scores_location,
review_scores_value,
reviews_per_month,
instant_bookable
)
# We convert the price from character to numeric
barcelona$price <- as.numeric(gsub("\\$","",barcelona$price))
#We convert bathrooms_text into a variable with an integer number of bathrooms
barcelona$bathrooms <- substr(barcelona$bathrooms_text, 1,2)
barcelona$bathrooms <- as.numeric(gsub("\\.","",barcelona$bathrooms))
#We convert the host_response_rate variable to numeric
barcelona$host_response_rate <- as.numeric(gsub("\\%","",barcelona$host_response_rate))
barcelona$host_acceptance_rate <- as.numeric(gsub("\\%","",barcelona$host_acceptance_rate))
#We convert the host_response_rate variable to numeric
barcelona$host_acceptance_rate <- as.numeric(gsub("\\%","",barcelona$host_acceptance_rate))
barcelona$host_acceptance_rate <- as.numeric(gsub("\\%","",barcelona$host_acceptance_rate))
#omit NA variables
barcelona <- na.omit(barcelona)
skim(barcelona)| Name | barcelona |
| Number of rows | 8159 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| logical | 5 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| neighbourhood_cleansed | 0 | 1 | 5 | 44 | 0 | 69 | 0 |
| neighbourhood_group_cleansed | 0 | 1 | 6 | 19 | 0 | 10 | 0 |
| property_type | 0 | 1 | 4 | 35 | 0 | 46 | 0 |
| room_type | 0 | 1 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 0 | 1 | 6 | 16 | 0 | 26 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 0 | 1 | 0.22 | FAL: 6342, TRU: 1817 |
| host_has_profile_pic | 0 | 1 | 1.00 | TRU: 8148, FAL: 11 |
| host_identity_verified | 0 | 1 | 0.90 | TRU: 7321, FAL: 838 |
| has_availability | 0 | 1 | 1.00 | TRU: 8154, FAL: 5 |
| instant_bookable | 0 | 1 | 0.48 | FAL: 4229, TRU: 3930 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 24747418.28 | 16027052.03 | 18674.00 | 9825059.50 | 24344024.00 | 39353745.50 | 51939670.00 | <U+2587><U+2585><U+2586><U+2586><U+2586> |
| host_response_rate | 0 | 1 | 92.23 | 17.00 | 0.00 | 92.00 | 100.00 | 100.00 | 100.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| host_acceptance_rate | 0 | 1 | 85.73 | 24.48 | 0.00 | 83.00 | 97.00 | 100.00 | 100.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| host_listings_count | 0 | 1 | 22.15 | 40.71 | 0.00 | 2.00 | 5.00 | 21.00 | 411.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1 | 41.39 | 0.01 | 41.35 | 41.38 | 41.39 | 41.40 | 41.46 | <U+2581><U+2587><U+2585><U+2581><U+2581> |
| longitude | 0 | 1 | 2.17 | 0.02 | 2.09 | 2.16 | 2.17 | 2.18 | 2.23 | <U+2581><U+2582><U+2587><U+2585><U+2581> |
| accommodates | 0 | 1 | 3.97 | 2.33 | 1.00 | 2.00 | 4.00 | 5.00 | 16.00 | <U+2587><U+2583><U+2581><U+2581><U+2581> |
| bedrooms | 0 | 1 | 1.85 | 1.04 | 1.00 | 1.00 | 2.00 | 2.00 | 10.00 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| beds | 0 | 1 | 2.71 | 2.03 | 0.00 | 1.00 | 2.00 | 4.00 | 20.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| price | 0 | 1 | 107.37 | 89.89 | 9.00 | 47.00 | 91.00 | 134.00 | 999.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1 | 9.38 | 16.73 | 1.00 | 1.00 | 3.00 | 5.00 | 600.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1 | 740.52 | 482.78 | 1.00 | 180.00 | 1125.00 | 1125.00 | 3000.00 | <U+2585><U+2587><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1 | 10.17 | 10.12 | 0.00 | 0.00 | 8.00 | 18.00 | 30.00 | <U+2587><U+2583><U+2582><U+2582><U+2582> |
| number_of_reviews | 0 | 1 | 52.33 | 78.92 | 1.00 | 4.00 | 18.00 | 68.00 | 853.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 0 | 1 | 4.55 | 0.51 | 1.00 | 4.41 | 4.67 | 4.87 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 0 | 1 | 4.64 | 0.51 | 1.00 | 4.53 | 4.78 | 4.93 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 0 | 1 | 4.59 | 0.52 | 1.00 | 4.47 | 4.72 | 4.91 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 0 | 1 | 4.72 | 0.46 | 1.00 | 4.67 | 4.87 | 5.00 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 0 | 1 | 4.71 | 0.49 | 1.00 | 4.66 | 4.86 | 5.00 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 0 | 1 | 4.76 | 0.36 | 1.00 | 4.69 | 4.86 | 5.00 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 0 | 1 | 4.49 | 0.52 | 1.00 | 4.34 | 4.60 | 4.79 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| reviews_per_month | 0 | 1 | 1.35 | 1.51 | 0.01 | 0.30 | 0.93 | 1.91 | 24.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bathrooms | 0 | 1 | 1.33 | 0.62 | 0.00 | 1.00 | 1.00 | 2.00 | 11.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Having now pulled the specific variables we want to analyse in our data frame and tidy it up a bit, we then proceed to separate property types based on specific subgroups.
#we group the property types into 5 categories according to frequency distribution to only control for the largest subgroups
barcelona <- barcelona %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit","Private room in rental unit", "Entire serviced apartment","Entire condominium (condo)") ~ property_type,
TRUE ~ "Other"
))
#We check whether the categorization has been performed correctly
barcelona %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n)) %>%
kbl(col.names = c("Property Type", "Category", "Count"),caption = "Count by Property Type") %>%
kable_styling() | Property Type | Category | Count |
|---|---|---|
| Entire rental unit | Entire rental unit | 4846 |
| Private room in rental unit | Private room in rental unit | 1929 |
| Entire serviced apartment | Entire serviced apartment | 271 |
| Entire condominium (condo) | Entire condominium (condo) | 155 |
| Entire loft | Other | 129 |
| Room in boutique hotel | Other | 123 |
| Private room in condominium (condo) | Other | 103 |
| Private room in hostel | Other | 85 |
| Private room in residential home | Other | 65 |
| Room in hotel | Other | 64 |
| Shared room in hostel | Other | 55 |
| Entire residential home | Other | 39 |
| Private room in loft | Other | 30 |
| Private room in bed and breakfast | Other | 27 |
| Entire guest suite | Other | 26 |
| Private room in guest suite | Other | 25 |
| Room in hostel | Other | 25 |
| Room in serviced apartment | Other | 23 |
| Shared room in rental unit | Other | 15 |
| Private room in serviced apartment | Other | 14 |
| Boat | Other | 12 |
| Entire townhouse | Other | 12 |
| Entire guesthouse | Other | 11 |
| Private room in casa particular | Other | 11 |
| Entire villa | Other | 7 |
| Private room | Other | 7 |
| Room in aparthotel | Other | 5 |
| Room in bed and breakfast | Other | 5 |
| Shared room in bed and breakfast | Other | 5 |
| Private room in dome house | Other | 4 |
| Private room in floor | Other | 4 |
| Private room in townhouse | Other | 4 |
| Private room in villa | Other | 4 |
| Private room in chalet | Other | 3 |
| Private room in guesthouse | Other | 3 |
| Casa particular | Other | 2 |
| Tiny house | Other | 2 |
| Barn | Other | 1 |
| Entire home/apt | Other | 1 |
| Entire place | Other | 1 |
| Private room in boat | Other | 1 |
| Private room in farm stay | Other | 1 |
| Shared room in guesthouse | Other | 1 |
| Shared room in hotel | Other | 1 |
| Shared room in loft | Other | 1 |
| Shared room in serviced apartment | Other | 1 |
#We exclude listings that have minimum nights requirement greater than 4
barcelona <- barcelona %>%
filter(minimum_nights<=4)
#Here we categorize the grouped neighbourhoods in Barcelona in 4 different categories: North, Center, Coastal Line, and West
barcelona <- barcelona %>%
mutate(neighbourhood_simplified = case_when(
neighbourhood_group_cleansed %in% c("Horta-Guinardó","Nou Barris","Sarrià-Sant Gervasi") ~ "North",
neighbourhood_group_cleansed %in% c("Eixample", "Gracia","Ciutat Vella") ~ "Center",
neighbourhood_group_cleansed %in% c("Sant Martí","Sants-Montjuïc") ~ "Coastal line outside Center",
TRUE ~ "West"
))
barcelona %>%
count(neighbourhood_group_cleansed, neighbourhood_simplified) %>%
arrange(desc(n))%>%
kbl(col.names = c("Neighbourhood", "Neighbourhood's Category", "Count"),caption = "Count and Category by Neighbourhood") %>%
kable_styling()| Neighbourhood | Neighbourhood’s Category | Count |
|---|---|---|
| Eixample | Center | 2670 |
| Ciutat Vella | Center | 968 |
| Sants-Montjuïc | Coastal line outside Center | 689 |
| Sant Martí | Coastal line outside Center | 599 |
| Gràcia | West | 518 |
| Sarrià-Sant Gervasi | North | 186 |
| Horta-Guinardó | North | 160 |
| Les Corts | West | 108 |
| Sant Andreu | West | 57 |
| Nou Barris | North | 45 |
Now, to make our analysis more comprehensive, we further split our data frame into 5 different categories:
host_specific <- barcelona %>%
select(host_response_rate,
host_acceptance_rate,
host_listings_count
)
property_specific <- barcelona %>%
select(accommodates,
bedrooms,
beds,
bathrooms,
)
reviews_specific <- barcelona %>%
select(number_of_reviews,
review_scores_accuracy,
review_scores_checkin,
review_scores_cleanliness,
review_scores_location,
review_scores_communication,
review_scores_value,
review_scores_rating
)
logical_specific <- barcelona %>%
select(host_is_superhost,
host_has_profile_pic,
host_identity_verified,
has_availability,
instant_bookable)
categorical_specific <- barcelona %>%
select(neighbourhood_group_cleansed,
prop_type_simplified,
room_type)Firstly, we look at the property-related variables, mutate the data frames we created for these to include new variables explaining how well the properties accommodate guests in terms of bathrooms, and then create density charts to visualise the sizes of the properties which we are exploring.
property_specific <- property_specific %>%
mutate(bathrooms_per_guest=bathrooms/accommodates,
bedrooms_per_guest=bedrooms/accommodates)
property_specific_longer <- property_specific %>%
select(bathrooms_per_guest,
bedrooms_per_guest,
bathrooms,
bedrooms,
beds,
accommodates) %>%
pivot_longer(names_to="variable_name", values_to="values",everything())
ggplot(property_specific_longer, aes(x=values), na.rm=TRUE)+
geom_density(fill="grey")+
facet_wrap(vars(variable_name), scales="free", ncol=3)+
labs(title="Property related variables affecting prices of Airbnb in Barcelona", x="", y="Density")+
theme(axis.title = element_text()) +
theme(axis.text.y=element_blank())+
theme_bw()+
NULLggpairs(property_specific)Looking at the distribution, we see that beds and accommodates seem to have a similar distribution. Thus, they might be correlated, if that’s the case we should only use one of them for our model. The other distributions are not unified, thus they might have an influence on prices.
In general, we can see a high positive correlation between bedrooms and beds, roughly 0.8, which suggests that we would better include one of them at a time for our model prediction. There is relatively low negative correlation between bedrooms/ bathrooms and number of bedrooms/ bathrooms per guest. Thus, we could include both at the same time in our model.
barcelona$host_response_rate <- barcelona$host_response_rate/100
barcelona$host_acceptance_rate <- barcelona$host_acceptance_rate/100
host_specific <- barcelona %>%
select(host_response_rate,
host_acceptance_rate,
host_listings_count,
) %>%
pivot_longer(names_to= "data_names", values_to="data_values", everything())
ggplot(host_specific, aes(x = data_values), na.rm=TRUE) +
geom_density(fill = "blue") +
facet_wrap(vars(data_names), scales="free") +
labs(x = "", y = "Density", title = "Host-Specific Variables affecting Prices of AirBnB in Barcelona") +
theme_bw() +
NULLggpairs(host_specific)Looking at these charts, we can see that the listings count is very unified, thus it might not influence prices. Acceptance rate seems to be the least unified, hence it could be interesting for our regression model. Last, we can note that the response rate shows a similar pattern to the acceptance rate, so they might be correlated.
reviews_specific %>%
pivot_longer(names_to= "data_names", values_to="data_values", everything())%>%
ggplot(aes(x = data_values), na.rm=TRUE) +
geom_density(fill = "red",alpha = 0.4) +
facet_wrap(vars(data_names), scales="free") +
labs(x = "", y = "Density", title = "Reviews-Specific Variables affecting Prices of AirBnB in Barcelona") +
theme_bw() +
NULLggpairs(reviews_specific)All the distributions are extremely skewed to the left. Most of them are very unified and might thus not be interesting to include in our model. The least unified variables that could be considered are: review score cleanliness, review score rating and review score value.
From the ggpair, we see there are high positive correlations among pairs of any two of the reviews_specific variables, except for the very first row, number of reviews. This is reasonable because review_scores_cleanliness, review_scores_communication, review_scores_value and review_scores_raing are somehow related to each other. Thus, we would better to include only one of these variables in our model prediction. (Note that, although we include both review_scores_rating and review_scores_value in our best model, we would check their collinearity again when we build our model to ensure there is no significant collinear relationships between our variables.)
#rename logical variables
colnames(logical_specific) <- c("Host is a Superhost", "Host has a Profile Picture", "Host's Identity is Verified", "Property is Available", "Property is Instantly Bookable")
#create table to better visualise logical variables
logical_visualisation <- logical_specific %>%
#pivot table to perform next steps
pivot_longer(cols = 1:5,
names_to = "var",
values_to = "logical") %>%
#group by variable and logical result (True or False)
group_by(var, logical) %>%
#count number of True and False for each variable
summarise(count = n()) %>%
#pivot wider to have one row per variable and one column for each "True" and "False" count
pivot_wider(names_from = "logical",
values_from = "count") %>%
#Format table with kableExtra, renmaing columns, adding title
kbl(col.names = c("Logical Variable", "Count of True", "Count of False"), caption = "Count of True and False by Logical Variable")%>%
kable_styling()
#display table
logical_visualisation| Logical Variable | Count of True | Count of False |
|---|---|---|
| Host’s Identity is Verified | 541 | 5459 |
| Host has a Profile Picture | 7 | 5993 |
| Host is a Superhost | 4720 | 1280 |
| Property is Available | 5 | 5995 |
| Property is Instantly Bookable | 2787 | 3213 |
Looking at the above table and considering what each variable represents we can conclude that: - most flats are not available, this potentially reflects a high demand and could influence prices (for instance the cheapest flats or the flats with the best price/quality balance are probably already booked) - a majority (~55%) of flats are instantly bookable, however, the distribution of True vs False is the most balanced of all variables. Consequently, this could have an important impact on the price. - almost all the hosts have a profile picture, thus this probably has no impact on the price. - a large majority (~80%) of hosts are superhosts, considering that superhosts are hosts who received great ratings, this could have a slight impact on the price. - a large majority (~80%) of hosts’ identities are verified, considering this is an indicator of security, it could slightly influence the price.
categorical_specific_longer <- categorical_specific %>%
pivot_longer(names_to="variable_name", values_to="values",everything()) %>%
group_by(variable_name,values) %>%
mutate(observations=n())
ggplot(categorical_specific_longer, aes(y=reorder(values,observations)),
na.rm=TRUE)+
geom_bar(fill="grey",orientation = "y")+
facet_wrap(vars(variable_name), scales="free", ncol=1)+
labs(title="Categorical variables affecting prices of Airbnb in Barcelona",
x="", y="Density")+
theme_bw()+
NULLWe can see that none of the variable is unified, thus they all have the potential to be useful in predicting variability in prices.
#we set the color for price to red and create the palette
price_color <- colorNumeric(palette = "Reds",
domain = c(1:200),
reverse = FALSE)
#We will use the listings data frame for the mapping, so we have to convert price to numeric here as well
listings$price <- as.numeric(gsub("\\$","",listings$price))
#Adding circle markers with popups and colors depending on price range
leaflet(data = filter(listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~price_color(price),
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)There is a direct correlation between property prices and proximity to the city center. There appear to be clusters of properties with distinctively higher prices, which could be explained by proximity to touristic locations or transportation connections
location_clusters <-leaflet(data = filter(listings, minimum_nights <= 4)) %>%
# Map with OpenStreetMap.Mapnik
addProviderTiles("OpenStreetMap.Mapnik") %>%
# Add circle markers with popups and labels
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillOpacity=0.6,
popup = ~listing_url,
label = ~property_type,
# cluster
clusterOptions = markerClusterOptions())
#print location_clusters
location_clustersThere appears to be a direct correlation between the number of properties and the proximity to the city center. Largest clustering in “Ciutat Vela” and ”Eixample” which are known mainly for their popular tourist spots and renowned architecture. At the same time, areas closer to the center boast easier transportation connections to most Barcelona neighbourhoods and viccinity to nightlife avenues.
We create price variable price_4_nights and log it, for our model regression.
barcelona_analysis <- barcelona %>%
#filter for 2 people
filter(accommodates >= 2) %>%
#calculate price for 4 nights
mutate(price_4_nights = price * 4,
logprice_4_nights = log(price_4_nights))
#create desnity plot
ggplot(data = barcelona_analysis, aes(price_4_nights)) +
geom_density()+
#add title and axis titles
labs(title = "Density Plot of the Price for 2 People for 4 Nights",
y = "Density",
x = "Price for 2 for 4 Nights")+
#change theme
theme_bw()+
NULL#create density plot with logarithmic scale
ggplot(data = barcelona_analysis, aes(logprice_4_nights)) +
geom_density()+
#add title and axis titles
labs(title = "Density Plot of the Price for 2 People for 4 Nights - Logarithmic Scale",
y = "Density - Logarithmic",
x = "Price for 2 for 4 Nights")+
#change theme
theme_bw()+
NULLglimpse(barcelona)Rows: 6,000
Columns: 35
$ id <dbl> 18674, 23197, 32711, 34981, 35379, 35388,~
$ host_response_rate <dbl> 1.00, 1.00, 1.00, 1.00, 0.90, 0.90, 0.90,~
$ host_acceptance_rate <dbl> 0.83, 0.70, 1.00, 0.83, 1.00, 1.00, 1.00,~
$ host_is_superhost <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TR~
$ host_listings_count <dbl> 35, 2, 5, 3, 4, 4, 4, 4, 1, 8, 5, 4, 10, ~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,~
$ host_identity_verified <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,~
$ neighbourhood_cleansed <chr> "la Sagrada Família", "el Besòs i el Mare~
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,~
$ instant_bookable <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU~
$ neighbourhood_group_cleansed <chr> "Eixample", "Sant Martí", "Gràcia", "Ciut~
$ latitude <dbl> 41.40556, 41.41291, 41.40566, 41.37978, 4~
$ longitude <dbl> 2.17262, 2.22063, 2.17015, 2.17623, 2.150~
$ property_type <chr> "Entire rental unit", "Entire rental unit~
$ room_type <chr> "Entire home/apt", "Entire home/apt", "En~
$ accommodates <dbl> 8, 6, 6, 9, 2, 1, 1, 2, 1, 4, 6, 6, 2, 3,~
$ bathrooms_text <chr> "2 baths", "2 baths", "1.5 baths", "3 bat~
$ bedrooms <dbl> 3, 3, 2, 4, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1,~
$ beds <dbl> 6, 5, 3, 6, 1, 1, 1, 1, 1, 1, 0, 4, 1, 1,~
$ price <dbl> 121, 229, 144, 189, 41, 31, 31, 46, 36, 5~
$ minimum_nights <dbl> 1, 4, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,~
$ maximum_nights <dbl> 1125, 300, 31, 365, 60, 60, 60, 60, 65, 3~
$ availability_30 <dbl> 1, 22, 0, 15, 0, 0, 0, 0, 14, 2, 0, 0, 20~
$ number_of_reviews <dbl> 21, 52, 63, 154, 358, 309, 310, 260, 79, ~
$ review_scores_rating <dbl> 4.40, 4.74, 4.31, 4.52, 4.72, 4.77, 4.83,~
$ review_scores_accuracy <dbl> 4.55, 4.92, 4.27, 4.66, 4.83, 4.84, 4.92,~
$ review_scores_cleanliness <dbl> 4.75, 4.92, 4.38, 4.62, 4.65, 4.76, 4.82,~
$ review_scores_checkin <dbl> 4.80, 4.94, 4.81, 4.65, 4.92, 4.90, 4.93,~
$ review_scores_communication <dbl> 4.90, 4.98, 4.79, 4.67, 4.93, 4.92, 4.91,~
$ review_scores_location <dbl> 4.75, 4.67, 4.81, 4.72, 4.77, 4.71, 4.85,~
$ review_scores_value <dbl> 4.30, 4.69, 4.40, 4.48, 4.74, 4.78, 4.83,~
$ reviews_per_month <dbl> 0.23, 0.75, 0.61, 1.54, 4.06, 3.49, 3.61,~
$ bathrooms <dbl> 2, 2, 1, 3, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,~
$ prop_type_simplified <chr> "Entire rental unit", "Entire rental unit~
$ neighbourhood_simplified <chr> "Center", "Coastal line outside Center", ~
correlation_matrix_numeric <- barcelona %>%
select(host_acceptance_rate, host_listings_count, latitude, longitude, accommodates, bedrooms, beds, price, minimum_nights, maximum_nights, availability_30, number_of_reviews, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, reviews_per_month, bathrooms) %>%
cor()
corrplot(correlation_matrix_numeric, is.corr = FALSE,
method="color",
type="full",
title="Correlation matrix of numerical variables",
tl.cex=.6,
tl.col="black",
cl.ratio=.3)Correlation <- cor(correlation_matrix_numeric , use="pairwise.complete.obs")set.seed(10)
# We create train_test_barcelona that is 75% for training and 25% for testing
train_test_barcelona <- initial_split(barcelona_analysis, prop = 0.75)
airbnb_listing_train <- training(train_test_barcelona)
airbnb_listing_test <- testing(train_test_barcelona)First, we fit a regression model called model1 with logprice_4_nights as the outcome variable and prop_type_simplified, number_of_reviews, and review_scores_ratingas explanatory variables.
model1 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating, data = barcelona_analysis)
car::vif(model1) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.018706 4 1.002319
number_of_reviews 1.047886 1 1.023663
review_scores_rating 1.046606 1 1.023038
model1 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate),
std.error = exp(std.error),
conf.low=exp(conf.low),
conf.high=exp(conf.high)
) %>%
kbl(col.names=c("Variable",
"Estimate",
"SE",
"t-stat",
"p-value",
"Lower CI",
"Upper CI")
) %>%
kable_styling()| Variable | Estimate | SE | t-stat | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 341.7275137 | 1.084801 | 71.6735262 | 0.0000000 | 291.3261850 | 400.8486007 |
| prop_type_simplifiedEntire rental unit | 0.9783626 | 1.050812 | -0.4413504 | 0.6589765 | 0.8877732 | 1.0781959 |
| prop_type_simplifiedEntire serviced apartment | 1.3557471 | 1.060666 | 5.1675163 | 0.0000002 | 1.2079092 | 1.5216792 |
| prop_type_simplifiedOther | 0.6453538 | 1.054056 | -8.3188813 | 0.0000000 | 0.5820706 | 0.7155173 |
| prop_type_simplifiedPrivate room in rental unit | 0.3385796 | 1.052435 | -21.1910134 | 0.0000000 | 0.3063017 | 0.3742589 |
| number_of_reviews | 0.9992673 | 1.000078 | -9.4143290 | 0.0000000 | 0.9991148 | 0.9994198 |
| review_scores_rating | 1.1153453 | 1.014766 | 7.4472350 | 0.0000000 | 1.0837508 | 1.1478609 |
Considering that we use logprice_4_nights as our outcome variable. We use exp() to get the estimate, standard error and confidence interval for price_4_nights.
According to the p-value, all the explanatory variables except prop_type_simplifiedEntire rental unit are significant.
review_scores_rating - Coefficient of review_scores_rating in terms of price_4_nights is 1.1153. If review_scores_rating increases by 1, price_4_nights will increase by 11.53%.
prop_type_simplified - Coefficient of prop_type_simplifiedEntire rental unit, in terms of price_4_nights is 0.9784. If property type changes from Entire condominium (condo) to Entire rental unit, price_4_nights will decrease by 2.16%. - Coefficient of prop_type_simplifiedEntire serviced apartment, in terms of price_4_nights is 1.3557. If property type changes from Entire condominium (condo) to Entire serviced apartment, price_4_nights will increase by 35.57%. - Coefficient of prop_type_simplifiedOther, in terms of price_4_nights is 0.6454. If property type changes from Entire condominium (condo) to Other, price_4_nights will decrease by 35.46%. - Coefficient of prop_type_simplifiedPrivate room in rental unit, in terms of price_4_nights is 0.3386. If property type changes from Entire condominium (condo) to Private room in rental unit, price_4_nights will decrease by 66.14%.
model1 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("R Squared",
"Adj. R Squared",
"Sigma",
"t-stat",
"p-value",
"Df")
) %>%
kable_styling()| R Squared | Adj. R Squared | Sigma | t-stat | p-value | Df |
|---|---|---|---|---|---|
| 0.4401521 | 0.4395462 | 0.4863421 | 726.4484 | 0 | 6 |
autoplot(model1,
alpha = 0.2,
label.size = 3) +
theme_minimal()The adjusted Rsquared of model1 is 0.4395, which means that the selected variables are able to explain 43.95% of the variation in prices. This is a good start but we will run other models and test other variables in order to improve this. Residuals of model1 do not obey Normal Distribution well, we can see some patterns, which let us think that we might be missing key explanatory variables for the prediction of the price.
#Constructing our predictions for the model and RMSE train from training data
train_rmse <-
airbnb_listing_train %>%
mutate(predictions = exp(predict(model1, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
airbnb_listing_test %>%
mutate(predictions = exp(predict(model1, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Training model output on RSquared
train_r_squared <-
summary(model1)$r.squared
#Testing model output on Rsquared
test_r_squared <-
cor(predict(model1, airbnb_listing_test), airbnb_listing_test$price_4_nights)
#Creating a matrix from output values
kbl(matrix(c(train_rmse,
test_rmse,
train_r_squared,
test_r_squared),
nrow = 2,
dimnames = list(c("Training","Testing"),
c("RMSE","RSquared"))),
) %>%
kable_styling()| RMSE | RSquared | |
|---|---|---|
| Training | 3736.565 | 0.4401521 |
| Testing | 2676.658 | 0.4070132 |
Difference of RMSE between training data set and testing data set is not very large and the training Rsquared is higher than the testing one.
Then, we want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. We fit model2 with a regression model called model1 with logprice_4_nights as the outcome variable and prop_type_simplified, number_of_reviews, review_scores_rating and room_rype as explanatory variables.
model2 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type, data = barcelona_analysis)
car::vif(model2) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 12.946270 4 1.377267
number_of_reviews 1.061837 1 1.030455
review_scores_rating 1.047441 1 1.023446
room_type 12.943643 3 1.532296
model2 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate),
std.error = exp(std.error),
conf.low=exp(conf.low),
conf.high=exp(conf.high)
) %>%
kbl(col.names=c("Variable",
"Estimate",
"SE",
"t-stat",
"p-value",
"Lower CI",
"Upper CI")
) %>%
kable_styling()| Variable | Estimate | SE | t-stat | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 347.7648012 | 1.082191 | 74.0815208 | 0.0000000 | 297.8766839 | 406.0081352 |
| prop_type_simplifiedEntire rental unit | 0.9778766 | 1.049262 | -0.4652412 | 0.6417770 | 0.8899052 | 1.0745443 |
| prop_type_simplifiedEntire serviced apartment | 1.3518788 | 1.058807 | 5.2761872 | 0.0000001 | 1.2086129 | 1.5121271 |
| prop_type_simplifiedOther | 0.9884007 | 1.063695 | -0.1889427 | 0.8501446 | 0.8757105 | 1.1155924 |
| prop_type_simplifiedPrivate room in rental unit | 0.6362133 | 1.070603 | -6.6286925 | 0.0000000 | 0.5565696 | 0.7272537 |
| number_of_reviews | 0.9991508 | 1.000076 | -11.1725409 | 0.0000000 | 0.9990019 | 0.9992998 |
| review_scores_rating | 1.1131383 | 1.014329 | 7.5337097 | 0.0000000 | 1.0825209 | 1.1446216 |
| room_typeHotel room | 0.9623769 | 1.064049 | -0.6177238 | 0.5367828 | 0.8520986 | 1.0869273 |
| room_typePrivate room | 0.5311568 | 1.048137 | -13.4574829 | 0.0000000 | 0.4843899 | 0.5824389 |
| room_typeShared room | 0.2865566 | 1.095642 | -13.6830693 | 0.0000000 | 0.2395765 | 0.3427493 |
According to p-value, room_type variables are all significant but room_typeHotel room. We see that changes to Hotel, Private or Shared room will result in a decrease in price of 3.8%, 46.9% and 71.4% respectively. This is coherent with what we would expect.
model2 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("R Squared",
"Adj. R Squared",
"Sigma",
"t-stat",
"p-value",
"Df")
) %>%
kable_styling()| R Squared | Adj. R Squared | Sigma | t-stat | p-value | Df |
|---|---|---|---|---|---|
| 0.4733089 | 0.4724534 | 0.4718483 | 553.2664 | 0 | 9 |
autoplot(model2,
alpha = 0.2,
label.size = 3) +
theme_minimal()The adjusted Rsquared of model2 is 0.4718, which means that the selected variables are able to explain 47.18%% of the variation in prices. This is an improvement compared to model1 (43.95%). Residuals of model2 do not obey Normal Distribution well, we can see some patterns as for model2, which let us think that we might still be missing key explanatory variables for the prediction of the price.
#Constructing our predictions for the model and RMSE train from training data
train_rmse <-
airbnb_listing_train %>%
mutate(predictions = exp(predict(model2, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
airbnb_listing_test %>%
mutate(predictions = exp(predict(model2, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Training model output on RSquared
train_r_squared <-
summary(model2)$r.squared
#Testing model output on Rsquared
test_r_squared <-
cor(predict(model2, airbnb_listing_test), airbnb_listing_test$price_4_nights)
#Creating a matrix from output values
kbl(matrix(c(train_rmse,
test_rmse,
train_r_squared,
test_r_squared),
nrow = 2,
dimnames = list(c("Training","Testing"),
c("RMSE","RSquared"))),
) %>%
kable_styling()| RMSE | RSquared | |
|---|---|---|
| Training | 3593.001 | 0.4733089 |
| Testing | 2551.835 | 0.4354227 |
Difference of RMSE between training data set and testing data set is not very large and the training Rsquared is higher than the testing one.
Our dataset has many more variables, so we are trying to find whether other variables are significant predictors of price_4_nights.
We want to know if bathrooms, bedrooms, beds, accomodates are significant predictors of price_4_nights. We fit model3 with a regression model called model1 with logprice_4_nights as the outcome variable and prop_type_simplified, number_of_reviews, review_scores_rating, room_rype, bathrooms, bedrooms, beds and accomodatesas explanatory variables.
model3 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bathrooms + bedrooms + beds + accommodates,
data = barcelona_analysis)
car::vif(model3) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 13.313991 4 1.382097
number_of_reviews 1.067452 1 1.033176
review_scores_rating 1.052516 1 1.025922
room_type 14.141768 3 1.555072
bathrooms 1.825769 1 1.351210
bedrooms 4.299252 1 2.073464
beds 4.399205 1 2.097428
accommodates 5.699291 1 2.387319
model3 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate),
std.error = exp(std.error),
conf.low=exp(conf.low),
conf.high=exp(conf.high)
) %>%
kbl(col.names=c("Variable",
"Estimate",
"SE",
"t-stat",
"p-value",
"Lower CI",
"Upper CI")
) %>%
kable_styling()| Variable | Estimate | SE | t-stat | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 179.9050961 | 1.074915 | 71.8758554 | 0.0000000 | 156.1485040 | 207.2760403 |
| prop_type_simplifiedEntire rental unit | 0.9609921 | 1.043610 | -0.9321436 | 0.3513030 | 0.8838485 | 1.0448689 |
| prop_type_simplifiedEntire serviced apartment | 1.3452082 | 1.052042 | 5.8452345 | 0.0000000 | 1.2178553 | 1.4858785 |
| prop_type_simplifiedOther | 0.9625259 | 1.056353 | -0.6966919 | 0.4860249 | 0.8644448 | 1.0717353 |
| prop_type_simplifiedPrivate room in rental unit | 0.6267396 | 1.062494 | -7.7075651 | 0.0000000 | 0.5565151 | 0.7058255 |
| number_of_reviews | 0.9993282 | 1.000068 | -9.9360345 | 0.0000000 | 0.9991957 | 0.9994607 |
| review_scores_rating | 1.1210482 | 1.012733 | 9.0308484 | 0.0000000 | 1.0935836 | 1.1492026 |
| room_typeHotel room | 1.2218695 | 1.057051 | 3.6115794 | 0.0003070 | 1.0959408 | 1.3622679 |
| room_typePrivate room | 0.7090163 | 1.043362 | -8.1011432 | 0.0000000 | 0.6524039 | 0.7705412 |
| room_typeShared room | 0.2876998 | 1.086365 | -15.0395481 | 0.0000000 | 0.2445754 | 0.3384280 |
| bathrooms | 1.1447843 | 1.011703 | 11.6212057 | 0.0000000 | 1.1189676 | 1.1711967 |
| bedrooms | 1.0438326 | 1.010597 | 4.0695368 | 0.0000478 | 1.0224826 | 1.0656284 |
| beds | 0.9931130 | 1.005456 | -1.2700306 | 0.2041271 | 0.9825754 | 1.0037636 |
| accommodates | 1.0698846 | 1.005727 | 11.8297469 | 0.0000000 | 1.0579748 | 1.0819286 |
According to the p-value, beds is non-significant variable and we should remove it from the model. Other variables are significant. An increase (by 1) in the number of accomodates, bedrooms and bathroomsall result in an increase in price of 6.99%, 4.38% and 14.48% respectively.
To make GVIFs comparable across variables, we can use the Square of GVIF^(1/(2*DF)). This reduces GVIF to a linear measure. If the calculated value is less than 5, then we can conclude that their is no collinearity. accomodates has Square of (2.352034)^2 = 5.7 >5, indicating collinearity. Therefore, we need to remove it from the model.
model3 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("R Squared",
"Adj. R Squared",
"Sigma",
"t-stat",
"p-value",
"Df")
) %>%
kable_styling()| R Squared | Adj. R Squared | Sigma | t-stat | p-value | Df |
|---|---|---|---|---|---|
| 0.5857428 | 0.5847702 | 0.4186164 | 602.2379 | 0 | 13 |
autoplot(model3,
alpha = 0.2,
label.size = 3) +
theme_minimal() The adjusted Rsquared of model3 is 0.5848, which means that the selected variables are able to explain 58.48% of the variation in prices. This is an improvement compared to model2 (47.18%). Residuals of model2 roughly obey Normal Distribution well, we can see less patterns than for model2, but we might still be missing explanatory variables to improve the prediction of the price.
#Constructing our predictions for the model and RMSE train from training data
train_rmse <-
airbnb_listing_train %>%
mutate(predictions = exp(predict(model3, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
airbnb_listing_test %>%
mutate(predictions = exp(predict(model3, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Training model output on RSquared
train_r_squared <-
summary(model3)$r.squared
#Testing model output on Rsquared
test_r_squared <-
cor(predict(model3, airbnb_listing_test), airbnb_listing_test$price_4_nights)
#Creating a matrix from output values
kbl(matrix(c(train_rmse,
test_rmse,
train_r_squared,
test_r_squared),
nrow = 2,
dimnames = list(c("Training","Testing"),
c("RMSE","RSquared"))),
) %>%
kable_styling()| RMSE | RSquared | |
|---|---|---|
| Training | 2710.236 | 0.5857428 |
| Testing | 1897.556 | 0.5968172 |
Difference of RMSE between training data set and testing data set is very small and the testing Rsquared is higher than the training one, which might imply slight overfitting, i.e. our model is becoming very complex.
We want to know if host_is_superhostcommands a pricing premium, after controlling for other variables. We fit model4 with a regression model called model4 with logprice_4_nights as the outcome variable and prop_type_simplified, number_of_reviews, review_scores_rating, room_rype, bathrooms, and host_is_superhostas explanatory variables.
model4 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bathrooms + host_is_superhost,
data = barcelona_analysis)
car::vif(model4) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 13.057355 4 1.378738
number_of_reviews 1.129981 1 1.063006
review_scores_rating 1.113245 1 1.055104
room_type 13.172275 3 1.536774
bathrooms 1.059876 1 1.029503
host_is_superhost 1.159588 1 1.076841
model4 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate),
std.error = exp(std.error),
conf.low=exp(conf.low),
conf.high=exp(conf.high)
) %>%
kbl(col.names=c("Variable",
"Estimate",
"SE",
"t-stat",
"p-value",
"Lower CI",
"Upper CI")
) %>%
kable_styling()| Variable | Estimate | SE | t-stat | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 240.0513800 | 1.077429 | 73.4915368 | 0.0000000 | 207.4003143 | 277.8427083 |
| prop_type_simplifiedEntire rental unit | 0.9951711 | 1.045371 | -0.1090917 | 0.9131337 | 0.9122628 | 1.0856142 |
| prop_type_simplifiedEntire serviced apartment | 1.3201377 | 1.054113 | 5.2701927 | 0.0000001 | 1.1905609 | 1.4638173 |
| prop_type_simplifiedOther | 0.9454133 | 1.058589 | -0.9858815 | 0.3242343 | 0.8455637 | 1.0570539 |
| prop_type_simplifiedPrivate room in rental unit | 0.5977301 | 1.064920 | -8.1814936 | 0.0000000 | 0.5283878 | 0.6761723 |
| number_of_reviews | 0.9992442 | 1.000072 | -10.4643353 | 0.0000000 | 0.9991027 | 0.9993857 |
| review_scores_rating | 1.0939414 | 1.013603 | 6.6455171 | 0.0000000 | 1.0653468 | 1.1233035 |
| room_typeHotel room | 1.1047255 | 1.059039 | 1.7362935 | 0.0825675 | 0.9872276 | 1.2362079 |
| room_typePrivate room | 0.6264124 | 1.044590 | -10.7222237 | 0.0000000 | 0.5750683 | 0.6823408 |
| room_typeShared room | 0.2915437 | 1.087809 | -14.6445338 | 0.0000000 | 0.2471986 | 0.3438437 |
| bathrooms | 1.3346095 | 1.009247 | 31.3580852 | 0.0000000 | 1.3107429 | 1.3589106 |
| host_is_superhostTRUE | 1.0448375 | 1.015416 | 2.8670650 | 0.0041586 | 1.0139672 | 1.0766476 |
According to the p-value, host_is_superhostis significant. It also commands a small pricing premium, after controlling for other variables (4.48%).
model4 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("R Squared",
"Adj. R Squared",
"Sigma",
"t-stat",
"p-value",
"Df")
) %>%
kable_styling()| R Squared | Adj. R Squared | Sigma | t-stat | p-value | Df |
|---|---|---|---|---|---|
| 0.5532422 | 0.5523549 | 0.4346491 | 623.5651 | 0 | 11 |
autoplot(model4,
alpha = 0.2,
label.size = 3) +
theme_minimal() The adjusted Rsquared of model4 is 0.5524, which means that the selected variables are able to explain 55.24% of the variation in prices. This is a decrease compared to model3 (58.48%), meaning that
host_is_superhost should probably not be included. Residuals of model4 display patterns again, which let us think that we might still be missing key explanatory variables for the prediction of the price.
#Constructing our predictions for the model and RMSE train from training data
train_rmse <-
airbnb_listing_train %>%
mutate(predictions = exp(predict(model4, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
airbnb_listing_test %>%
mutate(predictions = exp(predict(model4, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Training model output on RSquared
train_r_squared <-
summary(model4)$r.squared
#Testing model output on Rsquared
test_r_squared <-
cor(predict(model4, airbnb_listing_test), airbnb_listing_test$price_4_nights)
#Creating a matrix from output values
kbl(matrix(c(train_rmse,
test_rmse,
train_r_squared,
test_r_squared),
nrow = 2,
dimnames = list(c("Training","Testing"),
c("RMSE","RSquared"))),
) %>%
kable_styling()| RMSE | RSquared | |
|---|---|---|
| Training | 3087.457 | 0.5532422 |
| Testing | 1940.272 | 0.5633311 |
Difference of RMSE between training data set and testing data set is very small and the testing Rsquared is higher than the training one, which might imply slight overfitting, i.e. our model is becoming very complex.
model5 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bathrooms + instant_bookable,
data = barcelona_analysis)
car::vif(model5) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 13.136632 4 1.379782
number_of_reviews 1.069493 1 1.034163
review_scores_rating 1.049064 1 1.024238
room_type 13.249696 3 1.538276
bathrooms 1.061395 1 1.030240
instant_bookable 1.029694 1 1.014738
model5 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate),
std.error = exp(std.error),
conf.low=exp(conf.low),
conf.high=exp(conf.high)
) %>%
kbl(col.names=c("Variable",
"Estimate",
"SE",
"t-stat",
"p-value",
"Lower CI",
"Upper CI")
) %>%
kable_styling()| Variable | Estimate | SE | t-stat | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 225.0470317 | 1.077086 | 72.9376580 | 0.0000000 | 194.5582837 | 260.3135961 |
| prop_type_simplifiedEntire rental unit | 0.9828325 | 1.045267 | -0.3911397 | 0.6957090 | 0.9011284 | 1.0719445 |
| prop_type_simplifiedEntire serviced apartment | 1.3020667 | 1.054034 | 5.0157923 | 0.0000005 | 1.1744365 | 1.4435671 |
| prop_type_simplifiedOther | 0.9365372 | 1.058506 | -1.1531391 | 0.2489031 | 0.8377534 | 1.0469691 |
| prop_type_simplifiedPrivate room in rental unit | 0.5991242 | 1.064860 | -8.1518208 | 0.0000000 | 0.5296792 | 0.6776739 |
| number_of_reviews | 0.9993097 | 1.000070 | -9.8324013 | 0.0000000 | 0.9991721 | 0.9994473 |
| review_scores_rating | 1.1063802 | 1.013190 | 7.7148642 | 0.0000000 | 1.0783209 | 1.1351696 |
| room_typeHotel room | 1.0791701 | 1.059163 | 1.3255666 | 0.1850380 | 0.9641685 | 1.2078884 |
| room_typePrivate room | 0.6218689 | 1.044571 | -10.8935827 | 0.0000000 | 0.5709173 | 0.6773677 |
| room_typeShared room | 0.2893103 | 1.087699 | -14.7535725 | 0.0000000 | 0.2453535 | 0.3411424 |
| bathrooms | 1.3365026 | 1.009245 | 31.5182290 | 0.0000000 | 1.3126068 | 1.3608334 |
| instant_bookableTRUE | 1.0522043 | 1.011955 | 4.2819456 | 0.0000188 | 1.0279738 | 1.0770059 |
instant_bookable is significant according to its p-value. Also a change in that logical variable from False to True results in and 5.22% increase in price.
model5 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("R Squared",
"Adj. R Squared",
"Sigma",
"t-stat",
"p-value",
"Df")
) %>%
kable_styling()| R Squared | Adj. R Squared | Sigma | t-stat | p-value | Df |
|---|---|---|---|---|---|
| 0.5540553 | 0.5531697 | 0.4342534 | 625.6203 | 0 | 11 |
autoplot(model5,
alpha = 0.2,
label.size = 3) +
theme_minimal() The adjusted Rsquared of model5 is 0.5532, which means that the selected variables are able to explain 55.32% of the variation in prices. This is a slight increase compared to model4 (55.24%), meaning that
instant_bookable should probably be included. Residuals of model5 still display patterns, which let us think that we might still be missing key explanatory variables for the prediction of the price.
#Constructing our predictions for the model and RMSE train from training data
train_rmse <-
airbnb_listing_train %>%
mutate(predictions = exp(predict(model5, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
airbnb_listing_test %>%
mutate(predictions = exp(predict(model5, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Training model output on RSquared
train_r_squared <-
summary(model5)$r.squared
#Testing model output on Rsquared
test_r_squared <-
cor(predict(model5, airbnb_listing_test), airbnb_listing_test$price_4_nights)
#Creating a matrix from output values
kbl(matrix(c(train_rmse,
test_rmse,
train_r_squared,
test_r_squared),
nrow = 2,
dimnames = list(c("Training","Testing"),
c("RMSE","RSquared"))),
) %>%
kable_styling()| RMSE | RSquared | |
|---|---|---|
| Training | 3082.670 | 0.5540553 |
| Testing | 1933.803 | 0.5639097 |
Difference of RMSE between training data set and testing data set is very small and the testing Rsquared is higher than the training one, which might imply slight overfitting, i.e. our model is becoming very complex.
model6 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bathrooms + instant_bookable + neighbourhood_simplified,
data = barcelona_analysis)
car::vif(model6) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 13.339522 4 1.382428
number_of_reviews 1.070205 1 1.034507
review_scores_rating 1.050618 1 1.024997
room_type 13.444714 3 1.542027
bathrooms 1.072098 1 1.035421
instant_bookable 1.032118 1 1.015932
neighbourhood_simplified 1.041735 3 1.006838
model6 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate),
std.error = exp(std.error),
conf.low=exp(conf.low),
conf.high=exp(conf.high)
) %>%
kbl(col.names=c("Variable",
"Estimate",
"SE",
"t-stat",
"p-value",
"Lower CI",
"Upper CI")
) %>%
kable_styling()| Variable | Estimate | SE | t-stat | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 248.8618797 | 1.076417 | 74.9191495 | 0.0000000 | 215.4090132 | 287.5099525 |
| prop_type_simplifiedEntire rental unit | 0.9817790 | 1.044631 | -0.4211485 | 0.6736630 | 0.9012362 | 1.0695197 |
| prop_type_simplifiedEntire serviced apartment | 1.2799448 | 1.053309 | 4.7522225 | 0.0000021 | 1.1560392 | 1.4171307 |
| prop_type_simplifiedOther | 0.9761745 | 1.057768 | -0.4293704 | 0.6676704 | 0.8744049 | 1.0897888 |
| prop_type_simplifiedPrivate room in rental unit | 0.6248330 | 1.064002 | -7.5804872 | 0.0000000 | 0.5532818 | 0.7056373 |
| number_of_reviews | 0.9993208 | 1.000069 | -9.8121777 | 0.0000000 | 0.9991851 | 0.9994565 |
| review_scores_rating | 1.1011566 | 1.013009 | 7.4552807 | 0.0000000 | 1.0736055 | 1.1294148 |
| room_typeHotel room | 1.0044313 | 1.058614 | 0.0776248 | 0.9381293 | 0.8983072 | 1.1230927 |
| room_typePrivate room | 0.5939059 | 1.044081 | -12.0786472 | 0.0000000 | 0.5457471 | 0.6463143 |
| room_typeShared room | 0.2655677 | 1.086686 | -15.9490452 | 0.0000000 | 0.2256302 | 0.3125743 |
| bathrooms | 1.3214794 | 1.009158 | 30.5772085 | 0.0000000 | 1.2980723 | 1.3453086 |
| instant_bookableTRUE | 1.0453513 | 1.011796 | 3.7820136 | 0.0001572 | 1.0215926 | 1.0696625 |
| neighbourhood_simplifiedCoastal line outside Center | 0.8679975 | 1.014634 | -9.7443160 | 0.0000000 | 0.8436250 | 0.8930742 |
| neighbourhood_simplifiedNorth | 0.8170994 | 1.024871 | -8.2222764 | 0.0000000 | 0.7786801 | 0.8574143 |
| neighbourhood_simplifiedWest | 0.8676094 | 1.019039 | -7.5298560 | 0.0000000 | 0.8361169 | 0.9002881 |
The neighbourhood variables are all significant according to p-value. Changing to North, Coastal outside Center or West will induce a decrease in price of 18.29%, 13.20%, 13.24% respectively.
model6 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("R Squared",
"Adj. R Squared",
"Sigma",
"t-stat",
"p-value",
"Df")
) %>%
kable_styling()| R Squared | Adj. R Squared | Sigma | t-stat | p-value | Df |
|---|---|---|---|---|---|
| 0.5670004 | 0.5659054 | 0.4280201 | 517.8023 | 0 | 14 |
autoplot(model6,
alpha = 0.2,
label.size = 3) +
theme_minimal() The adjusted Rsquared of model6 is 0.5659, which means that the selected variables are able to explain 56.59% of the variation in prices. This is a slight increase compared to model5 (55.32%), meaning that
neighbourhood variables should probably be included. Residuals of model5 still display patterns but less, which let us think that we might still be missing key explanatory variables for the prediction of the price.
#Constructing our predictions for the model and RMSE train from training data
train_rmse <-
airbnb_listing_train %>%
mutate(predictions = exp(predict(model6, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
airbnb_listing_test %>%
mutate(predictions = exp(predict(model6, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Training model output on RSquared
train_r_squared <-
summary(model6)$r.squared
#Testing model output on Rsquared
test_r_squared <-
cor(predict(model6, airbnb_listing_test), airbnb_listing_test$price_4_nights)
#Creating a matrix from output values
kbl(matrix(c(train_rmse,
test_rmse,
train_r_squared,
test_r_squared),
nrow = 2,
dimnames = list(c("Training","Testing"),
c("RMSE","RSquared"))),
) %>%
kable_styling()| RMSE | RSquared | |
|---|---|---|
| Training | 3006.856 | 0.5670004 |
| Testing | 1896.154 | 0.5705083 |
Difference of RMSE between training data set and testing data set is very small and the testing Rsquared is higher than the training one, which might imply slight overfitting, i.e. our model is becoming very complex.
We want to know the effect of avalability_30 or reviews_per_month on price_4_nights, after we control for other variables.
model7 <- lm(logprice_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type + bathrooms + instant_bookable + neighbourhood_simplified + availability_30 + reviews_per_month,
data = barcelona_analysis)
car::vif(model7) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 13.667818 4 1.386636
number_of_reviews 1.364425 1 1.168086
review_scores_rating 1.057701 1 1.028446
room_type 13.693264 3 1.546742
bathrooms 1.076702 1 1.037643
instant_bookable 1.035937 1 1.017810
neighbourhood_simplified 1.054034 3 1.008809
availability_30 1.055266 1 1.027261
reviews_per_month 1.327783 1 1.152295
model7 %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate),
std.error = exp(std.error),
conf.low=exp(conf.low),
conf.high=exp(conf.high)
) %>%
kbl(col.names=c("Variable",
"Estimate",
"SE",
"t-stat",
"p-value",
"Lower CI",
"Upper CI")
) %>%
kable_styling()| Variable | Estimate | SE | t-stat | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 222.2066430 | 1.074983 | 74.7334080 | 0.0000000 | 192.8400998 | 256.0452533 |
| prop_type_simplifiedEntire rental unit | 0.9694117 | 1.043443 | -0.7305194 | 0.4651037 | 0.8918716 | 1.0536932 |
| prop_type_simplifiedEntire serviced apartment | 1.2325031 | 1.051877 | 4.1333509 | 0.0000363 | 1.1161652 | 1.3609669 |
| prop_type_simplifiedOther | 0.9643810 | 1.056159 | -0.6637946 | 0.5068494 | 0.8664231 | 1.0734141 |
| prop_type_simplifiedPrivate room in rental unit | 0.6212936 | 1.062283 | -7.8772842 | 0.0000000 | 0.5518936 | 0.6994205 |
| number_of_reviews | 0.9997678 | 1.000076 | -3.0535164 | 0.0022725 | 0.9996187 | 0.9999169 |
| review_scores_rating | 1.1205088 | 1.012696 | 9.0191881 | 0.0000000 | 1.0931367 | 1.1485663 |
| room_typeHotel room | 0.9186459 | 1.057236 | -1.5245670 | 0.1274243 | 0.8236856 | 1.0245538 |
| room_typePrivate room | 0.5982460 | 1.042917 | -12.2260312 | 0.0000000 | 0.5509389 | 0.6496152 |
| room_typeShared room | 0.2502034 | 1.084310 | -17.1166199 | 0.0000000 | 0.2134907 | 0.2932295 |
| bathrooms | 1.3083512 | 1.008927 | 30.2422213 | 0.0000000 | 1.2857541 | 1.3313455 |
| instant_bookableTRUE | 1.0511371 | 1.011495 | 4.3636186 | 0.0000130 | 1.0278476 | 1.0749543 |
| neighbourhood_simplifiedCoastal line outside Center | 0.8557375 | 1.014285 | -10.9835164 | 0.0000000 | 0.8322703 | 0.8798665 |
| neighbourhood_simplifiedNorth | 0.8197943 | 1.024236 | -8.2974532 | 0.0000000 | 0.7821975 | 0.8591982 |
| neighbourhood_simplifiedWest | 0.8626368 | 1.018544 | -8.0418359 | 0.0000000 | 0.8321172 | 0.8942758 |
| availability_30 | 1.0083305 | 1.000572 | 14.5063939 | 0.0000000 | 1.0072007 | 1.0094616 |
| reviews_per_month | 0.9583110 | 1.004137 | -10.3137990 | 0.0000000 | 0.9505858 | 0.9660990 |
avalability_30 and reviews_per_monthare both significant according to p-value. Surprisingly an increase in reviews_per_month results in a 4.17% decrease in price. Increase in availability only results in a very slight increase in price.
model7 %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("R Squared",
"Adj. R Squared",
"Sigma",
"t-stat",
"p-value",
"Df")
) %>%
kable_styling()| R Squared | Adj. R Squared | Sigma | t-stat | p-value | Df |
|---|---|---|---|---|---|
| 0.5904035 | 0.5892193 | 0.4163676 | 498.5536 | 0 | 16 |
autoplot(model7,
alpha = 0.2,
label.size = 3) +
theme_minimal() The adjusted Rsquared of model7 is 0.5892, which means that the selected variables are able to explain 58.92% of the variation in prices. This is a slight increase compared to model6 (56.59%), meaning that
avalability_30 and reviews_per_month should probably be included. Residuals of model7 show an improvement, close to Normal Distribution and less patterns, which is good, howver we might still be able to improve the price’s prediction.
#Constructing our predictions for the model and RMSE train from training data
train_rmse <-
airbnb_listing_train %>%
mutate(predictions = exp(predict(model7, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
airbnb_listing_test %>%
mutate(predictions = exp(predict(model7, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Training model output on RSquared
train_r_squared <-
summary(model7)$r.squared
#Testing model output on Rsquared
test_r_squared <-
cor(predict(model7, airbnb_listing_test), airbnb_listing_test$price_4_nights)
#Creating a matrix from output values
kbl(matrix(c(train_rmse,
test_rmse,
train_r_squared,
test_r_squared),
nrow = 2,
dimnames = list(c("Training","Testing"),
c("RMSE","RSquared"))),
) %>%
kable_styling()| RMSE | RSquared | |
|---|---|---|
| Training | 2884.986 | 0.5904035 |
| Testing | 1781.748 | 0.5759478 |
Difference of RMSE between training data set and testing data set is not very large and the training Rsquared is now higher than the testing one, which is a good sign.
best_model <- lm(logprice_4_nights ~ prop_type_simplified + review_scores_rating + room_type + bathrooms + bedrooms + accommodates + instant_bookable + neighbourhood_simplified + availability_30 + reviews_per_month + host_is_superhost + host_response_rate + host_acceptance_rate + review_scores_cleanliness + review_scores_checkin + review_scores_location + review_scores_value,
data = barcelona_analysis)
msummary(best_model) Estimate Std. Error
(Intercept) 5.1892383 0.0993132
prop_type_simplifiedEntire rental unit -0.0371555 0.0408635
prop_type_simplifiedEntire serviced apartment 0.2539258 0.0485931
prop_type_simplifiedOther 0.0018282 0.0524648
prop_type_simplifiedPrivate room in rental unit -0.4156010 0.0579952
review_scores_rating 0.1452613 0.0261472
room_typeHotel room 0.0369883 0.0537111
room_typePrivate room -0.3810986 0.0406229
room_typeShared room -1.3119238 0.0792264
bathrooms 0.1281109 0.0111453
bedrooms 0.0524808 0.0095358
accommodates 0.0527500 0.0046754
instant_bookableTRUE 0.0458857 0.0123039
neighbourhood_simplifiedCoastal line outside Center -0.1451596 0.0138394
neighbourhood_simplifiedNorth -0.1692862 0.0234309
neighbourhood_simplifiedWest -0.1303869 0.0177368
availability_30 0.0069505 0.0005574
reviews_per_month -0.0431788 0.0036257
host_is_superhostTRUE 0.0763412 0.0140296
host_response_rate -0.2639991 0.0388522
host_acceptance_rate 0.1193673 0.0294330
review_scores_cleanliness 0.0736749 0.0185538
review_scores_checkin -0.0626868 0.0175955
review_scores_location 0.0948212 0.0219234
review_scores_value -0.1114280 0.0232620
t value Pr(>|t|)
(Intercept) 52.251 < 2e-16 ***
prop_type_simplifiedEntire rental unit -0.909 0.363253
prop_type_simplifiedEntire serviced apartment 5.226 1.80e-07 ***
prop_type_simplifiedOther 0.035 0.972204
prop_type_simplifiedPrivate room in rental unit -7.166 8.73e-13 ***
review_scores_rating 5.556 2.90e-08 ***
room_typeHotel room 0.689 0.491071
room_typePrivate room -9.381 < 2e-16 ***
room_typeShared room -16.559 < 2e-16 ***
bathrooms 11.495 < 2e-16 ***
bedrooms 5.504 3.89e-08 ***
accommodates 11.283 < 2e-16 ***
instant_bookableTRUE 3.729 0.000194 ***
neighbourhood_simplifiedCoastal line outside Center -10.489 < 2e-16 ***
neighbourhood_simplifiedNorth -7.225 5.69e-13 ***
neighbourhood_simplifiedWest -7.351 2.25e-13 ***
availability_30 12.470 < 2e-16 ***
reviews_per_month -11.909 < 2e-16 ***
host_is_superhostTRUE 5.441 5.51e-08 ***
host_response_rate -6.795 1.20e-11 ***
host_acceptance_rate 4.056 5.07e-05 ***
review_scores_cleanliness 3.971 7.25e-05 ***
review_scores_checkin -3.563 0.000370 ***
review_scores_location 4.325 1.55e-05 ***
review_scores_value -4.790 1.71e-06 ***
Residual standard error: 0.3991 on 5526 degrees of freedom
Multiple R-squared: 0.6241, Adjusted R-squared: 0.6225
F-statistic: 382.3 on 24 and 5526 DF, p-value: < 2.2e-16
car::vif(best_model) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 13.972116 4 1.390458
review_scores_rating 4.944021 1 2.223516
room_type 14.707991 3 1.565281
bathrooms 1.842621 1 1.357432
bedrooms 3.869552 1 1.967118
accommodates 4.202473 1 2.049993
instant_bookable 1.306395 1 1.142976
neighbourhood_simplified 1.130983 3 1.020726
availability_30 1.090789 1 1.044408
reviews_per_month 1.114209 1 1.055561
host_is_superhost 1.156422 1 1.075371
host_response_rate 1.358279 1 1.165452
host_acceptance_rate 1.572721 1 1.254082
review_scores_cleanliness 2.508431 1 1.583803
review_scores_checkin 2.026785 1 1.423652
review_scores_location 1.854451 1 1.361782
review_scores_value 3.906312 1 1.976439
best_model %>%
tidy(conf.int = TRUE) %>%
mutate(estimate=exp(estimate),
std.error = exp(std.error),
conf.low=exp(conf.low),
conf.high=exp(conf.high)
) %>%
kbl(col.names=c("Variable",
"Estimate",
"SE",
"t-stat",
"p-value",
"Lower CI",
"Upper CI")
) %>%
kable_styling()| Variable | Estimate | SE | t-stat | p-value | Lower CI | Upper CI |
|---|---|---|---|---|---|---|
| (Intercept) | 179.3319056 | 1.104412 | 52.2512243 | 0.0000000 | 147.6058131 | 217.8771397 |
| prop_type_simplifiedEntire rental unit | 0.9635263 | 1.041710 | -0.9092598 | 0.3632527 | 0.8893504 | 1.0438888 |
| prop_type_simplifiedEntire serviced apartment | 1.2890762 | 1.049793 | 5.2255577 | 0.0000002 | 1.1719445 | 1.4179148 |
| prop_type_simplifiedOther | 1.0018299 | 1.053865 | 0.0348460 | 0.9722038 | 0.9039119 | 1.1103550 |
| prop_type_simplifiedPrivate room in rental unit | 0.6599435 | 1.059710 | -7.1661332 | 0.0000000 | 0.5890205 | 0.7394063 |
| review_scores_rating | 1.1563417 | 1.026492 | 5.5555156 | 0.0000000 | 1.0985624 | 1.2171598 |
| room_typeHotel room | 1.0376809 | 1.055180 | 0.6886525 | 0.4910709 | 0.9339741 | 1.1529030 |
| room_typePrivate room | 0.6831105 | 1.041459 | -9.3813712 | 0.0000000 | 0.6308195 | 0.7397361 |
| room_typeShared room | 0.2693015 | 1.082449 | -16.5591823 | 0.0000000 | 0.2305613 | 0.3145510 |
| bathrooms | 1.1366790 | 1.011208 | 11.4946330 | 0.0000000 | 1.1121129 | 1.1617878 |
| bedrooms | 1.0538824 | 1.009581 | 5.5035776 | 0.0000000 | 1.0343642 | 1.0737688 |
| accommodates | 1.0541661 | 1.004686 | 11.2825571 | 0.0000000 | 1.0445483 | 1.0638725 |
| instant_bookableTRUE | 1.0469548 | 1.012380 | 3.7293680 | 0.0001939 | 1.0220039 | 1.0725148 |
| neighbourhood_simplifiedCoastal line outside Center | 0.8648843 | 1.013936 | -10.4888346 | 0.0000000 | 0.8417348 | 0.8886704 |
| neighbourhood_simplifiedNorth | 0.8442672 | 1.023708 | -7.2249056 | 0.0000000 | 0.8063640 | 0.8839521 |
| neighbourhood_simplifiedWest | 0.8777557 | 1.017895 | -7.3512104 | 0.0000000 | 0.8477597 | 0.9088131 |
| availability_30 | 1.0069747 | 1.000557 | 12.4698972 | 0.0000000 | 1.0058750 | 1.0080757 |
| reviews_per_month | 0.9577402 | 1.003632 | -11.9090355 | 0.0000000 | 0.9509568 | 0.9645719 |
| host_is_superhostTRUE | 1.0793308 | 1.014128 | 5.4414237 | 0.0000001 | 1.0500498 | 1.1094283 |
| host_response_rate | 0.7679742 | 1.039617 | -6.7949653 | 0.0000000 | 0.7116532 | 0.8287526 |
| host_acceptance_rate | 1.1267837 | 1.029870 | 4.0555531 | 0.0000507 | 1.0636080 | 1.1937118 |
| review_scores_cleanliness | 1.0764568 | 1.018727 | 3.9708901 | 0.0000725 | 1.0380067 | 1.1163312 |
| review_scores_checkin | 0.9392376 | 1.017751 | -3.5626657 | 0.0003702 | 0.9073919 | 0.9722010 |
| review_scores_location | 1.0994623 | 1.022166 | 4.3251185 | 0.0000155 | 1.0532101 | 1.1477456 |
| review_scores_value | 0.8945558 | 1.023535 | -4.7901268 | 0.0000017 | 0.8546778 | 0.9362944 |
All variables but prop_type_simplifiedEntire rental unit, prop_type_simplifiedOther and room_typeHotel room are significant according to their p-values. Changes in these variables all result in either a decrease or an increase in price, which is more or less important.
best_model %>%
glance() %>%
select(1:6) %>%
kbl(col.names=c("R Squared",
"Adj. R Squared",
"Sigma",
"t-stat",
"p-value",
"Df")
) %>%
kable_styling()| R Squared | Adj. R Squared | Sigma | t-stat | p-value | Df |
|---|---|---|---|---|---|
| 0.6241258 | 0.6224933 | 0.3991483 | 382.322 | 0 | 24 |
autoplot(best_model,
alpha = 0.2,
label.size = 3) +
theme_minimal() The adjusted Rsquared of our best model is 0.6225, which means that the selected variables are able to explain 62.25% of the variation in prices. This is a very good prediction model, higher than all models 1 to 7 that we obtained before. Moreover, the residuals of the model approximately follow a Normal Distribution, there is no clear pattern in the residuals. Thus, it seems that this model includes all key outcome variables to obtain a regression model that provides a good prediction of the analysed price.
#Constructing our predictions for the model and RMSE train from training data
train_rmse <-
airbnb_listing_train %>%
mutate(predictions = exp(predict(best_model, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Constructing our predictions for the model and RMSE test from testing data
test_rmse <-
airbnb_listing_test %>%
mutate(predictions = exp(predict(best_model, .))) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
#Training model output on RSquared
train_r_squared <-
summary(best_model)$r.squared
#Testing model output on Rsquared
test_r_squared <-
cor(predict(best_model, airbnb_listing_test), airbnb_listing_test$price_4_nights)
#Creating a matrix from output values
kbl(matrix(c(train_rmse,
test_rmse,
train_r_squared,
test_r_squared),
nrow = 2,
dimnames = list(c("Training","Testing"),
c("RMSE","RSquared"))),
) %>%
kable_styling()| RMSE | RSquared | |
|---|---|---|
| Training | 2524.989 | 0.6241258 |
| Testing | 1695.712 | 0.6091837 |
Difference of RMSE between training data set and testing data set is not very large and the training Rsquared is higher than the testing one, thus we do not have overfitting issues in our model.
huxreg(model1, model2, model3, model4, model5, model6, model7, best_model,
coefs=c("Property Type - Entire Unit"="prop_type_simplifiedEntire rental unit",
"Property Type - Entire Unit"="prop_type_simplifiedEntire serviced apartment",
"Property Type - Entire Unit"="prop_type_simplifiedPrivate room in rental unit",
"Property Type - Entire Unit"="prop_type_simplifiedOther",
"Number of Reviews"="number_of_reviews",
"Rating Score" = "review_scores_rating",
"Room Type - Hotel" = "room_typeHotel room",
"Room Type - Private Room"="room_typePrivate room",
"Room Type - Shared Room"="room_typeShared room",
"Number of Bathrooms"="bathrooms",
"Number of Bedrooms"="bedrooms",
"Number of Beds"="beds",
"Accommodates"="accommodates",
"Host is Superhost - TRUE"="host_is_superhostTRUE",
"Instant Bookable - TRUE"="instant_bookableTRUE",
"Neighbourhood - Coastal Line outside Center" = "neighbourhood_simplifiedCoastal line outside Center",
"Neighbourhood - West"="neighbourhood_simplifiedWest",
"Neighbourhood - North"="neighbourhood_simplifiedNorth",
"Availability next 30 days" = "availability_30",
"Reviews per month" = "reviews_per_month"))| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
|---|---|---|---|---|---|---|---|---|
| Property Type - Entire Unit | -0.022 | -0.022 | -0.040 | -0.005 | -0.017 | -0.018 | -0.031 | -0.037 |
| (0.050) | (0.048) | (0.043) | (0.044) | (0.044) | (0.044) | (0.043) | (0.041) | |
| Number of Reviews | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.000 ** | |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | ||
| Rating Score | 0.109 *** | 0.107 *** | 0.114 *** | 0.090 *** | 0.101 *** | 0.096 *** | 0.114 *** | 0.145 *** |
| (0.015) | (0.014) | (0.013) | (0.014) | (0.013) | (0.013) | (0.013) | (0.026) | |
| Room Type - Hotel | -0.038 | 0.200 *** | 0.100 | 0.076 | 0.004 | -0.085 | 0.037 | |
| (0.062) | (0.055) | (0.057) | (0.057) | (0.057) | (0.056) | (0.054) | ||
| Room Type - Private Room | -0.633 *** | -0.344 *** | -0.468 *** | -0.475 *** | -0.521 *** | -0.514 *** | -0.381 *** | |
| (0.047) | (0.042) | (0.044) | (0.044) | (0.043) | (0.042) | (0.041) | ||
| Room Type - Shared Room | -1.250 *** | -1.246 *** | -1.233 *** | -1.240 *** | -1.326 *** | -1.385 *** | -1.312 *** | |
| (0.091) | (0.083) | (0.084) | (0.084) | (0.083) | (0.081) | (0.079) | ||
| Number of Bathrooms | 0.135 *** | 0.289 *** | 0.290 *** | 0.279 *** | 0.269 *** | 0.128 *** | ||
| (0.012) | (0.009) | (0.009) | (0.009) | (0.009) | (0.011) | |||
| Number of Bedrooms | 0.043 *** | 0.052 *** | ||||||
| (0.011) | (0.010) | |||||||
| Number of Beds | -0.007 | |||||||
| (0.005) | ||||||||
| Accommodates | 0.068 *** | 0.053 *** | ||||||
| (0.006) | (0.005) | |||||||
| Host is Superhost - TRUE | 0.044 ** | 0.076 *** | ||||||
| (0.015) | (0.014) | |||||||
| Instant Bookable - TRUE | 0.051 *** | 0.044 *** | 0.050 *** | 0.046 *** | ||||
| (0.012) | (0.012) | (0.011) | (0.012) | |||||
| Neighbourhood - Coastal Line outside Center | -0.142 *** | -0.156 *** | -0.145 *** | |||||
| (0.015) | (0.014) | (0.014) | ||||||
| Neighbourhood - West | -0.142 *** | -0.148 *** | -0.130 *** | |||||
| (0.019) | (0.018) | (0.018) | ||||||
| Neighbourhood - North | -0.202 *** | -0.199 *** | -0.169 *** | |||||
| (0.025) | (0.024) | (0.023) | ||||||
| Availability next 30 days | 0.008 *** | 0.007 *** | ||||||
| (0.001) | (0.001) | |||||||
| Reviews per month | -0.043 *** | -0.043 *** | ||||||
| (0.004) | (0.004) | |||||||
| N | 5551 | 5551 | 5551 | 5551 | 5551 | 5551 | 5551 | 5551 |
| R2 | 0.440 | 0.473 | 0.586 | 0.553 | 0.554 | 0.567 | 0.590 | 0.624 |
| logLik | -3871.626 | -3702.179 | -3035.706 | -3245.339 | -3240.282 | -3158.521 | -3004.303 | -2765.837 |
| AIC | 7759.251 | 7426.358 | 6101.412 | 6516.677 | 6506.565 | 6349.042 | 6044.605 | 5583.674 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||||
The best model is preferable to other choices, as it offers the highest Rsquared, does not present any overfitting issue and has residuals that do not show clear patterns. In conclusion, it includes the most important outcome variables and enables good prediction of the variable of interest which is the price.
In this last part, we use our best model to predict the total cost of an Airbnb accommodation for a visit in Barcelona over reading week. More specifically, we want Airbnb’s in Barcelona that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90.
# we select: an Entire serviced apartment, which has at least 10 reviews, and an average rating of at least 4,5/5 to conduct our prediction
finally_done <- barcelona_analysis %>%
filter(property_type == 'Entire serviced apartment',
review_scores_rating >= 4.5,
number_of_reviews >= 10) %>%
# we take the exponential for prediction because we used log(y) in our model
mutate(prediction = exp(predict(best_model, .)))# we construct a confidence interval
upper_ci <- quantile(finally_done$prediction, 0.975)
lower_ci <- quantile(finally_done$prediction, 0.025)
# we calculate the mean
average_prediction <- mean(finally_done$prediction)
# we visualise
ggplot(finally_done, aes(x = prediction)) +
geom_histogram() +
geom_vline(xintercept = average_prediction, size = 0.7, color = 'red') +
geom_vline(xintercept = upper_ci, size = 0.7, color = 'blue') +
geom_vline(xintercept = lower_ci, size = 0.7, color = 'blue') +
labs(y = "count", x = "Price: 2 Guests for 4 Nights",
title = 'Estimate and CI for the Price of a Stay in a Barcelona AirBnB',
subtitle = 'Histogram of the Estimated Price') +
scale_x_continuous()+
theme_bw()+
NULLAfter reaching the conclusion of our case-study, we recommend the couple to travel to Barcelona and select a property that has at least a 4.5* rating, has more than 10 reviews in total, and is an entire serviced apartment. These are the best predictors of price and will give the most satisfying experience to the couple.
For future analysis, we would recommend introducing a following variable that we would call “proximity_to_center”. Based on our Heat and Cluster Map, we see that the closer the property is located to the center of Barcelona, the more expensive the 4 night rate is. We can also see on the cluster map, that the more central the location is the more properties there are available. Creating this variable could be done through coming up a new variable using the combination of longitude and latitude points. I believe that this would have significant explanatory power in predicting price.
We also have to be mindful, that the Covid-19 Pandemic distorted the data-set a lot, so in the future, we would want to exclude those datapoints from the set in order to exclude outliers and bias from our data set. To eliminate even more bias from the data-set it is imperative to understand that the houseing market, especially the rental one, is cyclical, with a lot of cyclicality, such as summer and christmas season and over the weekends and public holidays. This could be mitigate by introducing variables that account for cyclicality, or taking long-term period averages that could smooth out high and low periods in the market.
Finally, this is publicly sourced data, that is on the the market uploaded by all kinds of individuals, with a lot of oversight, but fake, scam listings can always make it through the filters. We do not know how these points are treated, nor whether they are removed from the dataset after they have been discovered. This leads to possibly false data, that could further bias our predictions.